Stack overflow fix - eliminate recursive implementation #4997

kgururaj · 2018-07-11T00:19:14Z

Fix for #4994

The recursive implementation should not have been accepted - my mistake.

droazen · 2018-07-12T16:49:32Z

I'd suggest doing a bug fix release @lbergelson once this fix goes in.

kgururaj · 2018-07-12T17:55:50Z

Tested with 10K intervals and 100 WES samples

codecov-io · 2018-07-12T17:55:50Z

Codecov Report

Merging #4997 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##              master     #4997      +/-   ##
==============================================
+ Coverage     86.367%   86.397%   +0.03%     
- Complexity     28832     28945     +113     
==============================================
  Files           1791      1791              
  Lines         133603    134348     +745     
  Branches       14920     15063     +143     
==============================================
+ Hits          115389    116073     +684     
- Misses         12810     12858      +48     
- Partials        5404      5417      +13

Impacted Files	Coverage Δ	Complexity Δ
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java	`91.089% <100%> (+0.157%)`	`75 <2> (+2)`	⬆️
...dinstitute/hellbender/utils/R/RScriptExecutor.java	`80.282% <0%> (-8.451%)`	`17% <0%> (-3%)`
...g/broadinstitute/hellbender/utils/io/Resource.java	`55.556% <0%> (-7.407%)`	`6% <0%> (-1%)`
...llbender/tools/walkers/bqsr/AnalyzeCovariates.java	`67.593% <0%> (-4.63%)`	`29% <0%> (-1%)`
...ute/hellbender/utils/recalibration/RecalUtils.java	`89.407% <0%> (-3.814%)`	`52% <0%> (-1%)`
.../sv/integration/SVIntegrationTestDataProvider.java	`90.909% <0%> (-3.209%)`	`2% <0%> (+1%)`
...te/hellbender/tools/spark/sv/utils/SVInterval.java	`85.507% <0%> (-1.993%)`	`64% <0%> (+28%)`
...walkers/bqsr/AnalyzeCovariatesIntegrationTest.java	`92.063% <0%> (-1.783%)`	`22% <0%> (-2%)`
...on/FindBreakpointEvidenceSparkIntegrationTest.java	`100% <0%> (ø)`	`11% <0%> (+4%)`	⬆️
...spark/sv/evidence/BreakpointDensityFilterTest.java	`100% <0%> (ø)`	`28% <0%> (+11%)`	⬆️
... and 9 more

cmnbroad · 2018-07-18T15:58:09Z

@lbergelson Do you want to look at this ? I can do a release tomorrow if we can get it in.

cmnbroad

One request for a test.

kgururaj · 2018-07-20T23:32:38Z

How do you want to test this? The error was triggered only if a large number of intervals (~1000) was imported by the tool.

cmnbroad · 2018-07-23T13:24:22Z

@kguraj It looks like there are existing integration tests that use intervals that cover a pretty wide genomic range. It should be easy to write a test that programmatically generates a large set of (10000) or so very small intervals (1bp) with small (1bp) gaps between them (the gaps are necessary since otherwise the intervals will be merged together by the engine) that fails without this change and passes with it. It doesn't necessarily have to verify the results, just successfully complete.

kgururaj · 2018-07-23T19:13:31Z

FYI: the test will take a long time to run.

Added the requested test.

cmnbroad · 2018-07-23T21:56:19Z

@kgururaj Thanks for adding the test. Running it locally on my laptop (without your fix) succeeds though - I have to bump it up from 1000 intervals to 9000 to reproduce the stack overflow. But if I do that, it takes a long time to run, since it appears to be creating lots of small partitions. Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ?

Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages:

[GenomicsDB::VariantStorageManager] INFO: ignore message "[TileDB::StorageManager] Error: Cannot list TileDB directory; Directory buffer overflow." in the previous line

followed by a failure that ends like this:

[TileDB::StorageManager] Error: Cannot store schema; Too many open files in system. libc++abi.dylib: terminating with uncaught exception of type LoadOperatorException: LoadOperatorException : Could not define TileDB array TileDB error message : [TileDB::StorageManager] Error: Cannot store schema; Too many open files in system

Can you reproduce that ?

kgururaj · 2018-07-23T23:06:01Z

Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ?
No - see the PR message #4645 (comment)

The multi-interval support in GenomicsDBImport tool is purely for convenience. For scalability with a large number of intervals and samples, you should use multiple processes, each writing to a small (1?) number of intervals.

Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages:
The first set of messages are spurious debug messages - no error in reality. I'll provide a jar without these messages.

The second set of messages are a result of too many file handles open per process - your system is limiting the number of file handles opened by a single process. Again, this goes back to the previous statement.

cmnbroad · 2018-07-24T12:49:51Z

@kgururaj Do those first messages originate here ? I'm not sure what GenomicsDB code path leads there, but it looks like TileDB considers them to be an error that results in a short-circuit return. I'd be concerned that masking them would hide some underlying error. Are there any tests that verify that data round-trips though GDB when an interval list large enough to trigger these messages is encountered ?

kgururaj · 2018-07-24T15:04:56Z

Yes, that's the error message. You should see this part of the GenomicsDB code that retries the function when an error is detected. So, you don't need to be concerned by those error messages. They are annoying though and I will disable them from being printed.

cmnbroad · 2018-07-25T19:27:20Z

@kguraj Thanks for the response(s). The test in the PR was super useful as a temporary test, but as you mentioned it runs pretty slowly, and as it stands the test passes on current master anyway. It seems to require on the order of 9000-1000 intervals instead rather than 1000 to actually hit stack overflow. Since that would be a very slow running test, I'm inclined to back it out.

Also, the user who originally reported the issue was using 11k intervals, and it seems that the stack overflow fix is unlikely to help in that case. Is there any guidance for users on what is a reasonable number of intervals per process ? It sounds like the intention was that it be used with pretty small intervals. Should we issue a warning message in GenomicsDBImport at some threshold number of intervals ?

Are you planning to produce a jar with the error messages suppressed for this PR?

kgururaj · 2018-07-26T17:00:39Z

It's your call whether you want the test or not

Yeah, it would be good to put an advisory message if the number of intervals is more than 100.

I uploaded a jar yesterday, but it's not showed up in Maven central

kgururaj · 2018-07-26T18:20:20Z

New jar without spurious debug messages

cmnbroad · 2018-07-27T18:30:07Z

@kgururaj Thx for the updated jar. Can you remove the test commit now, and then we can run this on travis once more and then we can merge ? Thx.

Use jar without TileDB verbose logging - error messages are printed out by GenomicsDB when required

kgururaj · 2018-07-27T20:56:03Z

Done

cmnbroad · 2018-07-27T21:35:22Z

@kgururaj It looks like the test is still included in the PR - I think it should come out since it doesn't use enough intervals to test the fix, and it would be too slow if it did.

kgururaj · 2018-07-27T22:10:53Z

It's no longer a test - just a private function. I just kept it in case somebody wishes to make it a good test in the future.

cmnbroad

Thanks @kgururaj!

droazen requested a review from lbergelson July 12, 2018 16:49

droazen assigned lbergelson Jul 12, 2018

kgururaj added the GenomicsDB label Jul 12, 2018

cmnbroad requested changes Jul 20, 2018

View reviewed changes

kgururaj force-pushed the stack_ovf_fix branch from 181dabe to d522499 Compare July 20, 2018 23:30

kgururaj force-pushed the stack_ovf_fix branch from d522499 to 7948146 Compare July 23, 2018 19:12

Stack overflow fix - eliminate recursive implementation

f3f704e

Use jar without TileDB verbose logging - error messages are printed out by GenomicsDB when required

kgururaj force-pushed the stack_ovf_fix branch from 63b88b9 to f3f704e Compare July 27, 2018 20:54

cmnbroad approved these changes Jul 27, 2018

View reviewed changes

cmnbroad merged commit 2347646 into master Jul 27, 2018

cmnbroad deleted the stack_ovf_fix branch July 27, 2018 22:29

cmnbroad mentioned this pull request Jul 27, 2018

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow fix - eliminate recursive implementation #4997

Stack overflow fix - eliminate recursive implementation #4997

kgururaj commented Jul 11, 2018

droazen commented Jul 12, 2018

kgururaj commented Jul 12, 2018

codecov-io commented Jul 12, 2018 •

edited

Loading

cmnbroad commented Jul 18, 2018

cmnbroad left a comment

kgururaj commented Jul 20, 2018

cmnbroad commented Jul 23, 2018 •

edited

Loading

kgururaj commented Jul 23, 2018

cmnbroad commented Jul 23, 2018 •

edited

Loading

kgururaj commented Jul 23, 2018

cmnbroad commented Jul 24, 2018 •

edited

Loading

kgururaj commented Jul 24, 2018

cmnbroad commented Jul 25, 2018 •

edited

Loading

kgururaj commented Jul 26, 2018

kgururaj commented Jul 26, 2018

cmnbroad commented Jul 27, 2018

kgururaj commented Jul 27, 2018

cmnbroad commented Jul 27, 2018

kgururaj commented Jul 27, 2018

cmnbroad left a comment

Stack overflow fix - eliminate recursive implementation #4997

Stack overflow fix - eliminate recursive implementation #4997

Conversation

kgururaj commented Jul 11, 2018

droazen commented Jul 12, 2018

kgururaj commented Jul 12, 2018

codecov-io commented Jul 12, 2018 • edited Loading

Codecov Report

cmnbroad commented Jul 18, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

kgururaj commented Jul 20, 2018

cmnbroad commented Jul 23, 2018 • edited Loading

kgururaj commented Jul 23, 2018

cmnbroad commented Jul 23, 2018 • edited Loading

kgururaj commented Jul 23, 2018

cmnbroad commented Jul 24, 2018 • edited Loading

kgururaj commented Jul 24, 2018

cmnbroad commented Jul 25, 2018 • edited Loading

kgururaj commented Jul 26, 2018

kgururaj commented Jul 26, 2018

cmnbroad commented Jul 27, 2018

kgururaj commented Jul 27, 2018

cmnbroad commented Jul 27, 2018

kgururaj commented Jul 27, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 12, 2018 •

edited

Loading

cmnbroad commented Jul 23, 2018 •

edited

Loading

cmnbroad commented Jul 23, 2018 •

edited

Loading

cmnbroad commented Jul 24, 2018 •

edited

Loading

cmnbroad commented Jul 25, 2018 •

edited

Loading